The Hybrid Digital Tree and Its Applications to Genomic Sequence Databases
نویسنده
چکیده
THE HYBRID DIGITAL TREE AND ITS APPLICATIONS TO GENOMIC SEQUENCE DATABASES By Qiang Xue This dissertation focuses on index structures, search algorithms, and applications for large string databases whose indexes cannot fit entirely in the main memory (RAM). String searching is a classic research topic that has received increasing attention in recent years, due to the rapid growth of digital text collections (strings) and the fast expansion of application range and complexity. Traditional string indexing approaches are either RAM-based or disk-based. The RAM-based structures perform poorly when a database index size exceeds that of the available RAM. On the other hand, disk-based structures do not take full advantage of the available RAM, which may result in overwhelmed Input/Output (I/O) operations. In this dissertation, a novel indexing approach, the Hybrid Digital tree (HD-tree), is proposed. The HDtree index contains two parts: the RAM-index and the disk-index. The RAM-index resides in the RAM to minimize the disk accesses; while the disk-index maintains the rest of the index on disks so that large databases can be indexed. The first half of this dissertation focuses on index structures. The HD-tree is proposed after investigating existing indexing techniques. Construction and search algorithms for the HD-tree are developed, and characteristics of the tree structure are discussed. The HD-tree is applied to prefix and substring searches, and is compared with the Prefix B-tree. The comparison shows that the HD-tree not only reduces I/O operations by a factor of two to three, but also reduces the total query processing time by one order of magnitude. The HD-tree is also applied to approximate string matching based on the Hamming distance, where the performance of the HD-tree surpasses that of the M-tree and the linear-scan approach. In the second half of this dissertation, the HD-tree is applied to indexing and searching genomic sequence databases, such as the entire GenBank protein sequence database. Since the GenBank data is massive, using the standard method to generate an HD-tree index takes dozens of hours. Therefore, the Sort-Merge method is proposed to reduce the construction time by an order of magnitude. Sequence search algorithms using scoring matrices are developed for the HD-tree. Compared with BLAST, a popular sequence search tool, the HD-tree not only reduces query time by a factor of four, but also finds more valid results for short queries. Finally, the HDtree is applied to sequence searches using the Profile Hidden Markov Model (PHMM), where it shows great success. Compared with one of the most popular PHMM search tools, HMMER, the HD-tree is orders of magnitude faster for short queries. In the appendix, the research of approximate q-gram matching in genomic sequence databases is presented. It is shown that searching genomic sequence databases using longer query word length and larger Hamming distance in the filtering stage provides an excellent opportunity for optimizing the search cost, while improving the quality of the search. This result provides further support and motivation for developing advanced indexing schemes, such as the HD-tree, for large genomic sequence databases. In summary, this dissertation not only develops a new tree structure for string indexing, but also successfully applies the structure to real applications. According to comparisons with existing techniques, the proposed data structure, the HD-tree, is promising for indexing and searching large string databases, especially genomic sequence databases. Acknowledgements Completing this dissertation is the biggest challenge in my life. Without the help of numerous people, I cannot succeed. My first thanks go to my adviser, Dr. Sakti Pramanik, who committed years in training and helping me to be a qualified researcher. He never gave up on me, although I was slow in learning. Many times, he invited me to his house during weekends to provide extra help. He offered valuable insight and suggestions in guiding my research direction. He read my paper and dissertation drafts and edited my grammar errors without a complaint. I am also thankful for Dr. Pramanik’s family, who welcomed me warmly whenever I visited and treated me many nice meals. My thanks also go to Dr. James Cole, a member of my PhD committee. As a biologist, Dr. Cole provided the knowledge and insight I needed to apply the HDtree to genomic sequence databases. He also offered many suggestions in finding applications and designing experiments. Without his help, the second half of the dissertation is not possible. I am also thankful to Dr. Qiang Zhu and Dr. Gang Qian, who helped me in developing the structure and algorithms of HD-tree, and publishing the research. I appreciate the help from other members of my PhD committee: Dr. Herman Hughes, Dr. William McCarthy, and Dr. Jon Sticklen. They encouraged me in the process of finishing this dissertation and offered valuable revision suggestions. I am especially thankful that Dr. Hughes, though retired, flew from Atlanta to attend my defense meeting. I am indebted to my parents for encouraging me to pursue this degree, and supporting me in every step along the way. They sacrificed much more than most parents in helping me to be successful. I shared many tears and laughs over this dissertation with my brothers and sisters in Christ. I cannot imagine how could I continue the study without their prayers and encouragements. Among them, Caleb Kelly and
منابع مشابه
Context-aware systems: concept, functions and applications in digital libraries
Background and Aim Among the places that context-aware systems and services would be very useful, are libraries. The purpose of this study is to achieve a coherent definition of context aware systems and applications, especially in digital libraries. Method: This was a review article that was conducted by using Library method by searching articles and e-books on websites and databases. Results:...
متن کاملComplete Genomic Sequence of a Strain of Tomato Yellow Leaf Curl Virus from Iran
Background and Aims: Tomato yellow leaf curl virus (TYLCV) is one of the most destructive viruses of tomato that leads to reduced tomato yield up to 100% in tropical and subtropical regions. In this study, the complete sequence of TYLCV isolate from Hormozgan province, Iran and its recombination evsent was determined. Methods: TYLCV infected tomato was collected from Hormozgan province. Total D...
متن کاملStrong Convergence of the Iterations of Quasi $phi$-nonexpansive Mappings and its Applications in Banach Spaces
In this paper, we study the iterations of quasi $phi$-nonexpansive mappings and its applications in Banach spaces. At the first, we prove strong convergence of the sequence generated by the hybrid proximal point method to a common fixed point of a family of quasi $phi$-nonexpansive mappings. Then, we give applications of our main results in equilibrium problems.
متن کاملA hybrid approach for database intrusion detection at transaction and inter-transaction levels
Nowadays, information plays an important role in organizations. Sensitive information is often stored in databases. Traditional mechanisms such as encryption, access control, and authentication cannot provide a high level of confidence. Therefore, the existence of Intrusion Detection Systems in databases is necessary. In this paper, we propose an intrusion detection system for detecting attacks...
متن کاملEfficient Agrobacterium-Mediated Transformation and Analysis of Transgenic Plants in Hybrid Black Poplar (Populus × euromericana Dode Guinier)
Black poplar (Populus× euramericana Dode Guinier) is an industrially important tree with broad applications in wood and paper, biofuel and cellulose-based industries as well as plant breeding programs and soil phytoremediation approaches. Here, we have focused on development of direct shoot regeneration and Agrobacterium-mediated transformation protocols using the in vitro internodal stem tissu...
متن کامل